MLOps - End-to-End Machine Learning Workflow with KizenML, XAI, and Cloud Deployment


Group Members:

SAKTHIVEL G 2022AC05688
ADITYA GAURAV 2022ac05101
BHUVANJEET SINGH GANDHI 2022ac05606
KALRA GEETANSH 2022ac05174

No contribution from Harish Bhagwan Raggal and Neeraj Gupta (tried contacting, no response also)

Group 18 - MLOps Assignment 2

1. Data Collection and Preprocessing¶

1.1 Install Required Libraries¶

The command installs three key libraries necessary for our assignment:

  • dataprep: Used for data preparation and exploratory data analysis.
  • shap: A tool for model interpretability, providing global and local explanations of model predictions.
  • lime: Another model interpretability tool that offers local explanations for individual predictions.

We are using the auto-sklearn Docker image from the official site to run this notebook, which already includes most of the dependencies on auto-sklearn.

In [22]:
! pip install dataprep shap lime
Requirement already satisfied: dataprep in /usr/local/lib/python3.8/dist-packages (0.4.5)
Requirement already satisfied: shap in /usr/local/lib/python3.8/dist-packages (0.44.1)
Requirement already satisfied: lime in /usr/local/lib/python3.8/dist-packages (0.2.0.1)
Requirement already satisfied: pydot<2.0.0,>=1.4.2 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.4.2)
Requirement already satisfied: sqlalchemy==1.3.24 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.3.24)
Requirement already satisfied: rapidfuzz<3.0.0,>=2.1.2 in /usr/local/lib/python3.8/dist-packages (from dataprep) (2.15.2)
Requirement already satisfied: aiohttp<4.0,>=3.6 in /usr/local/lib/python3.8/dist-packages (from dataprep) (3.10.5)
Requirement already satisfied: tqdm<5.0,>=4.48 in /usr/local/lib/python3.8/dist-packages (from dataprep) (4.66.5)
Requirement already satisfied: python-crfsuite==0.9.8 in /usr/local/lib/python3.8/dist-packages (from dataprep) (0.9.8)
Requirement already satisfied: dask[array,dataframe,delayed]>=2022.3.0 in /usr/local/lib/python3.8/dist-packages (from dataprep) (2022.9.1)
Requirement already satisfied: regex<2022.0.0,>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from dataprep) (2021.11.10)
Requirement already satisfied: pandas<2.0,>=1.1 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.5.0)
Requirement already satisfied: pydantic<2.0,>=1.6 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.10.18)
Requirement already satisfied: ipywidgets<8.0,>=7.5 in /usr/local/lib/python3.8/dist-packages (from dataprep) (7.8.4)
Requirement already satisfied: jsonpath-ng<2.0,>=1.5 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.6.1)
Requirement already satisfied: bokeh<3,>=2 in /usr/local/lib/python3.8/dist-packages (from dataprep) (2.4.3)
Requirement already satisfied: scipy<2.0,>=1.8 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.9.1)
Requirement already satisfied: jinja2<3.1,>=3.0 in /usr/local/lib/python3.8/dist-packages (from dataprep) (3.0.3)
Requirement already satisfied: nltk<4.0.0,>=3.6.7 in /usr/local/lib/python3.8/dist-packages (from dataprep) (3.9.1)
Requirement already satisfied: numpy<2.0,>=1.21 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.23.3)
Requirement already satisfied: wordcloud<2.0,>=1.8 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.9.3)
Requirement already satisfied: python-stdnum<2.0,>=1.16 in /usr/local/lib/python3.8/dist-packages (from dataprep) (1.20)
Requirement already satisfied: flask_cors<4.0.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from dataprep) (3.0.10)
Requirement already satisfied: varname<0.9.0,>=0.8.1 in /usr/local/lib/python3.8/dist-packages (from dataprep) (0.8.3)
Requirement already satisfied: metaphone<0.7,>=0.6 in /usr/local/lib/python3.8/dist-packages (from dataprep) (0.6)
Requirement already satisfied: flask<3,>=2 in /usr/local/lib/python3.8/dist-packages (from dataprep) (2.2.5)
Requirement already satisfied: slicer==0.0.7 in /usr/local/lib/python3.8/dist-packages (from shap) (0.0.7)
Requirement already satisfied: packaging>20.9 in /usr/local/lib/python3.8/dist-packages (from shap) (21.3)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (from shap) (0.24.2)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.8/dist-packages (from shap) (2.2.0)
Requirement already satisfied: numba in /usr/local/lib/python3.8/dist-packages (from shap) (0.58.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.8/dist-packages (from lime) (3.6.0)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.8/dist-packages (from lime) (0.21.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.3.1)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.11.1)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (22.1.0)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (4.0.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (6.1.0)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (2.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.4.1)
Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.8/dist-packages (from bokeh<3,>=2->dataprep) (6.0)
Requirement already satisfied: tornado>=5.1 in /usr/local/lib/python3.8/dist-packages (from bokeh<3,>=2->dataprep) (6.1)
Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.8/dist-packages (from bokeh<3,>=2->dataprep) (9.2.0)
Requirement already satisfied: typing-extensions>=3.10.0 in /usr/local/lib/python3.8/dist-packages (from bokeh<3,>=2->dataprep) (4.3.0)
Requirement already satisfied: fsspec>=0.6.0 in /usr/local/lib/python3.8/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (2022.8.2)
Requirement already satisfied: toolz>=0.8.2 in /usr/local/lib/python3.8/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (0.12.0)
Requirement already satisfied: partd>=0.3.10 in /usr/local/lib/python3.8/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (1.3.0)
Requirement already satisfied: Werkzeug>=2.2.2 in /usr/local/lib/python3.8/dist-packages (from flask<3,>=2->dataprep) (3.0.4)
Requirement already satisfied: itsdangerous>=2.0 in /usr/local/lib/python3.8/dist-packages (from flask<3,>=2->dataprep) (2.2.0)
Requirement already satisfied: importlib-metadata>=3.6.0 in /usr/local/lib/python3.8/dist-packages (from flask<3,>=2->dataprep) (4.12.0)
Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.8/dist-packages (from flask<3,>=2->dataprep) (8.1.3)
Requirement already satisfied: Six in /usr/local/lib/python3.8/dist-packages (from flask_cors<4.0.0,>=3.0.10->dataprep) (1.16.0)
Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (8.5.0)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (0.2.0)
Requirement already satisfied: comm>=0.1.3 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (0.2.2)
Requirement already satisfied: jupyterlab-widgets<3,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (1.1.10)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (5.4.0)
Requirement already satisfied: widgetsnbextension~=3.6.9 in /usr/local/lib/python3.8/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (3.6.9)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2<3.1,>=3.0->dataprep) (2.1.1)
Requirement already satisfied: ply in /usr/local/lib/python3.8/dist-packages (from jsonpath-ng<2.0,>=1.5->dataprep) (3.11)
Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk<4.0.0,>=3.6.7->dataprep) (1.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging>20.9->shap) (3.0.9)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas<2.0,>=1.1->dataprep) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.8/dist-packages (from pandas<2.0,>=1.1->dataprep) (2022.2.1)
Requirement already satisfied: tifffile>=2022.8.12 in /usr/local/lib/python3.8/dist-packages (from scikit-image>=0.12->lime) (2023.7.10)
Requirement already satisfied: networkx>=2.8 in /usr/local/lib/python3.8/dist-packages (from scikit-image>=0.12->lime) (3.1)
Requirement already satisfied: imageio>=2.27 in /usr/local/lib/python3.8/dist-packages (from scikit-image>=0.12->lime) (2.35.1)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.8/dist-packages (from scikit-image>=0.12->lime) (1.4.1)
Requirement already satisfied: lazy_loader>=0.2 in /usr/local/lib/python3.8/dist-packages (from scikit-image>=0.12->lime) (0.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn->shap) (3.1.0)
Requirement already satisfied: pure_eval<1.0.0 in /usr/local/lib/python3.8/dist-packages (from varname<0.9.0,>=0.8.1->dataprep) (0.2.2)
Requirement already satisfied: executing<0.9.0,>=0.8.3 in /usr/local/lib/python3.8/dist-packages (from varname<0.9.0,>=0.8.1->dataprep) (0.8.3)
Requirement already satisfied: asttokens<3.0.0,>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from varname<0.9.0,>=0.8.1->dataprep) (2.0.8)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib->lime) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->lime) (1.4.4)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib->lime) (4.37.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib->lime) (1.0.5)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.8/dist-packages (from numba->shap) (0.41.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/dist-packages (from importlib-metadata>=3.6.0->flask<3,>=2->dataprep) (3.8.1)
Requirement already satisfied: prompt-toolkit<3.1.0,>3.0.1 in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (3.0.31)
Requirement already satisfied: jedi>=0.16 in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.18.1)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.1.6)
Requirement already satisfied: decorator in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (5.1.1)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (4.8.0)
Requirement already satisfied: stack-data in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.5.0)
Requirement already satisfied: pygments>=2.4.0 in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (2.13.0)
Requirement already satisfied: backcall in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.2.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.8/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.7.5)
Requirement already satisfied: locket in /usr/local/lib/python3.8/dist-packages (from partd>=0.3.10->dask[array,dataframe,delayed]>=2022.3.0->dataprep) (1.0.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.8/dist-packages (from widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (6.4.12)
Requirement already satisfied: idna>=2.0 in /usr/local/lib/python3.8/dist-packages (from yarl<2.0,>=1.0->aiohttp<4.0,>=3.6->dataprep) (3.4)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /usr/local/lib/python3.8/dist-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.8.3)
Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (4.11.1)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.14.1)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.8.0)
Requirement already satisfied: jupyter-client>=5.3.4 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (7.3.4)
Requirement already satisfied: ipykernel in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (6.15.3)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (21.3.0)
Requirement already satisfied: nbformat in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (5.5.0)
Requirement already satisfied: nbconvert>=5 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (7.0.0)
Requirement already satisfied: pyzmq>=17 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (24.0.0)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.15.0)
Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.8/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.5.5)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.8/dist-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.8/dist-packages (from prompt-toolkit<3.1.0,>3.0.1->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.2.5)
Requirement already satisfied: entrypoints in /usr/local/lib/python3.8/dist-packages (from jupyter-client>=5.3.4->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.4)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.1.1)
Requirement already satisfied: bleach in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (5.0.1)
Requirement already satisfied: mistune<3,>=2.0.3 in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (2.0.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (4.11.1)
Requirement already satisfied: lxml in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (4.9.1)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.6.8)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.2.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.5.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.8/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.7.1)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.8/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (2.16.2)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.8/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (4.16.0)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.8/dist-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (21.2.0)
Requirement already satisfied: psutil in /usr/local/lib/python3.8/dist-packages (from ipykernel->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (5.9.2)
Requirement already satisfied: debugpy>=1.0 in /usr/local/lib/python3.8/dist-packages (from ipykernel->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.6.3)
Requirement already satisfied: pkgutil-resolve-name>=1.3.10 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.3.10)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.18.1)
Requirement already satisfied: importlib-resources>=1.4.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (5.9.0)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (1.15.1)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.8/dist-packages (from beautifulsoup4->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (2.3.2.post1)
Requirement already satisfied: webencodings in /usr/local/lib/python3.8/dist-packages (from bleach->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (0.5.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.8/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.9->ipywidgets<8.0,>=7.5->dataprep) (2.21)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 177, in emit
    self.console.print(renderable, overflow="ignore", crop=False, style=style)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1673, in print
    extend(render(renderable, render_options))
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1305, in render
    for render_output in iter_render:
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 134, in __rich_console__
    for line in lines:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/segment.py", line 249, in split_lines
    for segment in segments:
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/console.py", line 1283, in render
    renderable = rich_cast(renderable)
  File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/rich/protocol.py", line 36, in rich_cast
    renderable = cast_method()
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/self_outdated_check.py", line 130, in __rich__
    pip_cmd = get_best_invocation_for_this_pip()
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/entrypoints.py", line 58, in get_best_invocation_for_this_pip
    if found_executable and os.path.samefile(
  File "/usr/lib/python3.8/genericpath.py", line 101, in samefile
    s2 = os.stat(f2)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/bin/pip3.8'
Call stack:
  File "/usr/local/bin/pip", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/main.py", line 70, in main
    return command.main(cmd_args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/base_command.py", line 101, in main
    return self._main(args)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/base_command.py", line 223, in _main
    self.handle_pip_version_check(options)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/cli/req_command.py", line 190, in handle_pip_version_check
    pip_self_version_check(session, options)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/self_outdated_check.py", line 236, in pip_self_version_check
    logger.warning("[present-rich] %s", upgrade_prompt)
  File "/usr/lib/python3.8/logging/__init__.py", line 1458, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python3.8/logging/__init__.py", line 1589, in _log
    self.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
    self.emit(record)
  File "/usr/local/lib/python3.8/dist-packages/pip/_internal/utils/logging.py", line 179, in emit
    self.handleError(record)
Message: '[present-rich] %s'
Arguments: (UpgradePrompt(old='22.2.2', new='24.2'),)

1.2 Importing Necessary Libraries:¶

This block imports all the key libraries required for the assignment:

  • pandas: For data manipulation and analysis.
  • autosklearn: Automated machine learning library used for building and training models.
  • dataprep.eda: Used for generating exploratory data analysis (EDA) reports.
  • sklearn.preprocessing: Provides tools like StandardScaler for feature scaling.
  • sklearn.model_selection: Used for splitting the dataset into training and test sets.
  • pickle: For saving (dumping) and loading models or other objects.
  • lime and shap: XAI tools used for model interpretability, helping us explain the model’s predictions.
  • NumPy: For numerical operations.
In [27]:
import pandas as pd
import autosklearn
from dataprep.eda import create_report
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from pickle import dump, load
import lime
import lime.lime_tabular
import numpy as np
import shap


import warnings
warnings.filterwarnings('ignore')

print(autosklearn.__version__)
0.15.0

1.3 Loading Dataset and Generating EDA Report:¶

  1. Load the Dataset: The liver disease dataset is loaded into a pandas DataFrame (df) from a CSV file (liver_disease_1.csv).
  2. Exploratory Data Analysis (EDA):
    • The create_report() function from dataprep.eda is used to generate an automated exploratory data analysis report for the dataset.
    • This report provides detailed insights into the data, including statistics, distributions, and relationships between features.
  3. Saving the EDA Report: • The report is saved as an HTML file (Liver_disease_dataset_eda_report.html) for further review and analysis.
In [24]:
df = pd.read_csv("liver_disease_1.csv")
df.head()
Out[24]:
Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens Albumin Albumin_and_Globulin_Ratio Dataset
0 65 0.7 0.1 187 16 18 6.8 3.3 0.90 Yes
1 62 10.9 5.5 699 64 100 7.5 3.2 0.74 Yes
2 62 7.3 4.1 490 60 68 7.0 3.3 0.89 Yes
3 58 1.0 0.4 182 14 20 6.8 3.4 1.00 Yes
4 72 3.9 2.0 195 27 59 7.3 2.4 0.40 Yes
In [29]:
report = create_report(df, title='Liver Disease Dataset EDA Report')
report.save('Liver_disease_dataset_eda_report.html')
report
  0%|                                                                                                         …
Report has been saved to Liver_disease_dataset_eda_report.html!
Out[29]:
Liver Disease Dataset EDA Report
Liver Disease Dataset EDA Report Overview
Variables ≡
Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens Albumin Albumin_and_Globulin_Ratio Dataset
Interactions Correlations Missing Values

Overview

Dataset Statistics

Number of Variables 10
Number of Rows 583
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 13
Duplicate Rows (%) 2.2%
Total Size in Memory 75.1 KB
Average Row Size in Memory 131.9 B
Variable Types
  • Numerical: 9
  • Categorical: 1

Dataset Insights

Total_Bilirubin is skewed Skewed
Direct_Bilirubin is skewed Skewed
Alkaline_Phosphotase is skewed Skewed
Alamine_Aminotransferase is skewed Skewed
Aspartate_Aminotransferase is skewed Skewed
Albumin_and_Globulin_Ratio is skewed Skewed
Dataset has 13 (2.23%) duplicate rows Duplicates

Variables


Age

numerical

Approximate Distinct Count 72
Approximate Unique (%) 12.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 44.7461
Minimum 4
Maximum 90
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Age is skewed left (γ1 = -0.0293)

Quantile Statistics

Minimum 4
5-th Percentile 18
Q1 33
Median 45
Q3 58
95-th Percentile 72
Maximum 90
Range 86
IQR 25

Descriptive Statistics

Mean 44.7461
Standard Deviation 16.1898
Variance 262.1107
Sum 26087
Skewness -0.02931
Kurtosis -0.5655
Coefficient of Variation 0.3618

Total_Bilirubin

numerical

Approximate Distinct Count 113
Approximate Unique (%) 19.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 3.2988
Minimum 0.4
Maximum 75
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Total_Bilirubin is skewed right (γ1 = 4.8948)

Quantile Statistics

Minimum 0.4
5-th Percentile 0.6
Q1 0.8
Median 1
Q3 2.6
95-th Percentile 16.35
Maximum 75
Range 74.6
IQR 1.8

Descriptive Statistics

Mean 3.2988
Standard Deviation 6.2095
Variance 38.5582
Sum 1923.2
Skewness 4.8948
Kurtosis 36.8356
Coefficient of Variation 1.8824
  • Total_Bilirubin is not normally distributed (p-value 3.0889323140925768e-24)
  • Total_Bilirubin has 84 outliers

Direct_Bilirubin

numerical

Approximate Distinct Count 80
Approximate Unique (%) 13.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 1.4861
Minimum 0.1
Maximum 19.7
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Direct_Bilirubin is skewed right (γ1 = 3.2041)

Quantile Statistics

Minimum 0.1
5-th Percentile 0.1
Q1 0.2
Median 0.3
Q3 1.3
95-th Percentile 8.4
Maximum 19.7
Range 19.6
IQR 1.1

Descriptive Statistics

Mean 1.4861
Standard Deviation 2.8085
Variance 7.8877
Sum 866.4
Skewness 3.2041
Kurtosis 11.2451
Coefficient of Variation 1.8898
  • Direct_Bilirubin is not normally distributed (p-value 1.07825122604191e-23)
  • Direct_Bilirubin has 81 outliers

Alkaline_Phosphotase

numerical

Approximate Distinct Count 263
Approximate Unique (%) 45.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 290.5763
Minimum 63
Maximum 2110
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Alkaline_Phosphotase is skewed right (γ1 = 3.7554)

Quantile Statistics

Minimum 63
5-th Percentile 137
Q1 175.5
Median 208
Q3 298
95-th Percentile 698.1
Maximum 2110
Range 2047
IQR 122.5

Descriptive Statistics

Mean 290.5763
Standard Deviation 242.938
Variance 59018.8666
Sum 169406
Skewness 3.7554
Kurtosis 17.5907
Coefficient of Variation 0.8361
  • Alkaline_Phosphotase is not normally distributed (p-value 1.978462946138677e-15)
  • Alkaline_Phosphotase has 69 outliers

Alamine_Aminotransferase

numerical

Approximate Distinct Count 152
Approximate Unique (%) 26.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 80.7136
Minimum 10
Maximum 2000
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Alamine_Aminotransferase is skewed right (γ1 = 6.5323)

Quantile Statistics

Minimum 10
5-th Percentile 15
Q1 23
Median 35
Q3 60.5
95-th Percentile 232
Maximum 2000
Range 1990
IQR 37.5

Descriptive Statistics

Mean 80.7136
Standard Deviation 182.6204
Variance 33350.1944
Sum 47056
Skewness 6.5323
Kurtosis 50.1364
Coefficient of Variation 2.2626
  • Alamine_Aminotransferase is not normally distributed (p-value 1.3922521760227646e-23)
  • Alamine_Aminotransferase has 73 outliers

Aspartate_Aminotransferase

numerical

Approximate Distinct Count 177
Approximate Unique (%) 30.4%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 109.9108
Minimum 10
Maximum 4929
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Aspartate_Aminotransferase is skewed right (γ1 = 10.519)

Quantile Statistics

Minimum 10
5-th Percentile 15.1
Q1 25
Median 42
Q3 87
95-th Percentile 400.9
Maximum 4929
Range 4919
IQR 62

Descriptive Statistics

Mean 109.9108
Standard Deviation 288.9185
Variance 83473.9164
Sum 64078
Skewness 10.519
Kurtosis 149.6184
Coefficient of Variation 2.6287
  • Aspartate_Aminotransferase is not normally distributed (p-value 9.535055700647434e-25)
  • Aspartate_Aminotransferase has 66 outliers

Total_Protiens

numerical

Approximate Distinct Count 58
Approximate Unique (%) 9.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 6.4832
Minimum 2.7
Maximum 9.6
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Total_Protiens is skewed left (γ1 = -0.2849)

Quantile Statistics

Minimum 2.7
5-th Percentile 4.61
Q1 5.8
Median 6.6
Q3 7.2
95-th Percentile 8.1
Maximum 9.6
Range 6.9
IQR 1.4

Descriptive Statistics

Mean 6.4832
Standard Deviation 1.0855
Variance 1.1782
Sum 3779.7
Skewness -0.2849
Kurtosis 0.2208
Coefficient of Variation 0.1674
  • Total_Protiens is not normally distributed (p-value 0.000297306684315054)
  • Total_Protiens has 8 outliers

Albumin

numerical

Approximate Distinct Count 40
Approximate Unique (%) 6.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 3.1419
Minimum 0.9
Maximum 5.5
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Albumin is skewed left (γ1 = -0.0436)

Quantile Statistics

Minimum 0.9
5-th Percentile 1.8
Q1 2.6
Median 3.1
Q3 3.8
95-th Percentile 4.39
Maximum 5.5
Range 4.6
IQR 1.2

Descriptive Statistics

Mean 3.1419
Standard Deviation 0.7955
Variance 0.6329
Sum 1831.7
Skewness -0.04357
Kurtosis -0.3949
Coefficient of Variation 0.2532

Albumin_and_Globulin_Ratio

numerical

Approximate Distinct Count 69
Approximate Unique (%) 11.8%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 9328
Mean 0.947
Minimum 0.3
Maximum 2.8
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • Albumin_and_Globulin_Ratio is skewed right (γ1 = 0.9896)

Quantile Statistics

Minimum 0.3
5-th Percentile 0.5
Q1 0.7
Median 0.93
Q3 1.1
95-th Percentile 1.5
Maximum 2.8
Range 2.5
IQR 0.4

Descriptive Statistics

Mean 0.947
Standard Deviation 0.3189
Variance 0.1017
Sum 552.13
Skewness 0.9896
Kurtosis 3.2583
Coefficient of Variation 0.3367
  • Albumin_and_Globulin_Ratio is not normally distributed (p-value 1.2459763665111632e-10)
  • Albumin_and_Globulin_Ratio has 10 outliers

Dataset

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.3%
Missing 0
Missing (%) 0.0%
Memory Size 39477
  • The largest value (Yes) is over 2.49 times larger than the second largest value (No)

Length

Mean 2.7136
Standard Deviation 0.4525
Median 3
Minimum 2
Maximum 3

Sample

1st row Yes
2nd row Yes
3rd row Yes
4th row Yes
5th row Yes

Letter

Count 1582
Lowercase Letter 999
Space Separator 0
Uppercase Letter 583
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (Yes, No) take over 50.0%
  • The largest value (yes) is over 2.49 times larger than the second largest value (no)

Interactions

Correlations

Missing Values

Report generated with DataPrep

1.4 Data Preprocessing and Oversampling for Class Imbalance:¶

  1. Random Oversampling Function:
    • The random_oversample() function performs random oversampling on the minority class in the training data to address class imbalance. It duplicates the minority class instances to match the size of the majority class, helping to create a balanced dataset.
  2. Handling Missing Values:
    • Missing values in the Albumin_and_Globulin_Ratio column are forward-filled (ffill), ensuring that no null values remain in the dataset.
  3. Data Splitting:
    • The dataset is split into features (X) and target labels (y), where the target label indicates whether the patient has liver disease (1 for “Yes”, 0 for “No”).
    • The dataset is split into training and test sets using train_test_split(), with 20% of the data reserved for testing. Stratified sampling is used to maintain the proportion of liver disease cases in both training and test sets, preventing any potential data leakage.
  4. Class Imbalance Handling:
    • Random oversampling is applied to the training data using the random_oversample() function, which helps mitigate the imbalance between liver disease and non-liver disease cases.
  5. Feature Scaling:
    • Standard scaling is applied to the training features using StandardScaler(), ensuring that all features are on a similar scale, which improves the model’s performance.
    • The scaled data is saved to a new DataFrame X_train_final.
  6. Saving the Scaler:
    • The fitted scaler is saved as a pickle file (scaler.pkl) to ensure that the same scaling can be applied to the test data later.

This preprocessing step is crucial to ensure that the model is trained on balanced and standardized data, minimizing bias and improving generalization.

In [25]:
def random_oversample(X_train, y_train, target_column='target'):
    """
    Perform random oversampling on the minority class to balance the dataset.

    Parameters:
    X_train (pd.DataFrame): Feature data.
    y_train (pd.Series): Target labels.
    target_column (str): The name of the target column to be used in the combined DataFrame. Default is 'target'.

    Returns:
    pd.DataFrame, pd.Series: Resampled X_train and y_train.
    """
    # Combine X_train and y_train for resampling
    X_train[target_column] = y_train

    # Separate minority and majority classes
    minority_class = X_train[X_train[target_column] == 1]
    majority_class = X_train[X_train[target_column] == 0]

    # Perform random over-sampling on the minority class
    minority_oversampled = minority_class.sample(n=len(majority_class), replace=True, random_state=42)

    # Combine the oversampled minority class with the majority class
    oversampled_data = pd.concat([majority_class, minority_oversampled])

    # Separate X_train and y_train again
    X_train_resampled = oversampled_data.drop(target_column, axis=1)
    y_train_resampled = oversampled_data[target_column]
    
    return X_train_resampled, y_train_resampled


df["Albumin_and_Globulin_Ratio"].ffill(inplace=True)

test_ratio = 0.2

# Before we apply any feature engineering technique like upsampling or normalisation we need to split out dataset btw train and test to avoid data leakage.

X = df.iloc[:,:-1]
y = df.iloc[:,[-1]]

y = y["Dataset"].map({"Yes":1, "No":0}) # Yes means patient has liver disease.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y , test_size=test_ratio , random_state=100) 

# Fixing class imabalnce using SMOTE
X_train, y_train = random_oversample(X_train, y_train)
# lets do standard scalling of all other features so that we have all the features before we put into model.

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(X_train)

X_train_final = pd.DataFrame(x_train_scaled, columns=X_train.columns)


dump(scaler, open('./scaler.pkl', 'wb'))

2. Model Training, Hyperparameter Tuning, and Selection Using Auto-sklearn¶

2.1 Task Overview:¶

The goal of this task was to train multiple models, tune hyperparameters, and select the best-performing model for predicting liver disease. To automate the model selection and hyperparameter tuning process, we used AutoML via the Auto-sklearn library, which automates the entire workflow, including data preprocessing, model training, and hyperparameter optimization.

2.2 Experimentation Process:¶

  1. Auto-sklearn Setup:
    • Auto-sklearn was configured with a total time limit of 180 seconds for the entire task and a 40-second time limit per run. This allowed Auto-sklearn to test various models and configurations within a defined time frame.
    • Auto-sklearn automatically handled tasks like data preprocessing (including feature scaling), model selection, and hyperparameter tuning across various algorithms.
  2. Generated Models:
    • Auto-sklearn trained several models, including Random Forest, AdaBoost, Extra Trees, Passive-Aggressive Classifier, MLPClassifier, and Linear Discriminant Analysis (LDA).
    • Each model was tested with different hyperparameters and preprocessing techniques to optimize performance.
    • The top-performing models were ranked based on their cost (lower is better) and ensemble weight (how important they are in the final ensemble model).
  3. Top Models:
    • RandomForestClassifier: Ranked 1st, it achieved the lowest cost and was included in the final ensemble with an ensemble weight of 0.02.
    • AdaBoostClassifier: Featured multiple times in the top 5 models, with different hyperparameters and base estimators.
    • ExtraTreesClassifier: Appeared frequently in the top models list, with varying hyperparameter settings.
    • PassiveAggressiveClassifier: Ranked 3rd, contributing to the ensemble model.
  4. Ensemble Model:
    • Auto-sklearn automatically created an ensemble of the best models based on their performance. The ensemble model combined predictions from multiple models (like RandomForest and AdaBoost) to create a more robust final prediction.

2.3 Justification for Model Choice:¶

  • RandomForestClassifier was chosen as the best-performing model based on its rank (1st), low cost (0.18), and inclusion in the final ensemble. Random Forest is well-known for its strong performance in classification tasks and its robustness to overfitting, making it a suitable choice for our liver disease dataset.
  • AdaBoostClassifier also performed well and was included in the final ensemble with different configurations. This demonstrates that boosting methods can effectively enhance model performance by focusing on difficult-to-classify instances.
  • ExtraTreesClassifier appeared frequently in the top models list, reinforcing its importance in the ensemble. Extra Trees is known for its ability to handle high-dimensional data, which may have contributed to its success.

By using Auto-sklearn, we were able to automate the model training and hyperparameter tuning process, testing a wide range of algorithms and configurations within a limited time frame. The final ensemble model, consisting of RandomForest, AdaBoost, Extra Trees, and other classifiers, provides a well-balanced approach to predicting liver disease. The ability to automatically select and combine models ensures robust performance and minimizes overfitting.

In [6]:
import autosklearn.classification as auto_classifier

autoclassifier = auto_classifier.AutoSklearnClassifier(time_left_for_this_task=180, 
                                                       per_run_time_limit=40)
autoclassifier.fit(X_train_final, y_train)
autoclassifier.show_models()
Out[6]:
{10: {'model_id': 10,
  'rank': 1,
  'cost': 0.18181818181818177,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff08a2880>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fff8177f640>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7ffff0543d60>,
  'sklearn_classifier': RandomForestClassifier(criterion='entropy', max_features=2, n_estimators=512,
                         n_jobs=1, random_state=1, warm_start=True)},
 12: {'model_id': 12,
  'rank': 2,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fff9027bee0>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fff88745430>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fff88745fa0>,
  'sklearn_classifier': AdaBoostClassifier(algorithm='SAMME',
                     base_estimator=DecisionTreeClassifier(max_depth=2),
                     learning_rate=0.13167493237005792, n_estimators=56,
                     random_state=1)},
 16: {'model_id': 16,
  'rank': 3,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fff9022fee0>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fff8878a310>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fff8878abb0>,
  'sklearn_classifier': PassiveAggressiveClassifier(C=0.14833233294431605, average=True,
                              loss='squared_hinge', max_iter=16, random_state=1,
                              tol=0.00016482166646253793, warm_start=True)},
 17: {'model_id': 17,
  'rank': 4,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fff8873d340>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fff9005d730>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fff9005da00>,
  'sklearn_classifier': AdaBoostClassifier(algorithm='SAMME',
                     base_estimator=DecisionTreeClassifier(max_depth=2),
                     learning_rate=0.03734246906377268, n_estimators=416,
                     random_state=1)},
 21: {'model_id': 21,
  'rank': 5,
  'cost': 0.18181818181818177,
  'ensemble_weight': 0.04,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff075e040>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7ffff0254be0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7ffff0254dc0>,
  'sklearn_classifier': ExtraTreesClassifier(criterion='entropy', max_features=44, min_samples_leaf=2,
                       min_samples_split=20, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
 24: {'model_id': 24,
  'rank': 6,
  'cost': 0.13636363636363635,
  'ensemble_weight': 0.1,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fff902c6460>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7ffff017d610>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7ffff017d7f0>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=2, min_samples_split=4, n_estimators=512,
                       n_jobs=1, random_state=1, warm_start=True)},
 29: {'model_id': 29,
  'rank': 7,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.08,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff02391c0>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb1a7cf40>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7ffff05fd0a0>,
  'sklearn_classifier': MLPClassifier(alpha=0.0007119897774330087, beta_1=0.999, beta_2=0.9,
                hidden_layer_sizes=(51, 51, 51),
                learning_rate_init=0.00028079049815589414, max_iter=128,
                n_iter_no_change=32, random_state=1, validation_fraction=0.0,
                verbose=0, warm_start=True)},
 34: {'model_id': 34,
  'rank': 8,
  'cost': 0.23863636363636365,
  'ensemble_weight': 0.12,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff01f1160>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb18bed30>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb173db50>,
  'sklearn_classifier': ExtraTreesClassifier(bootstrap=True, max_features=1, min_samples_leaf=14,
                       min_samples_split=14, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
 46: {'model_id': 46,
  'rank': 9,
  'cost': 0.23863636363636365,
  'ensemble_weight': 0.08,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff01a6520>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb175c370>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb175c610>,
  'sklearn_classifier': ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=1,
                       min_samples_leaf=16, min_samples_split=5, n_estimators=512,
                       n_jobs=1, random_state=1, warm_start=True)},
 58: {'model_id': 58,
  'rank': 10,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.1,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb1950c40>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb14682b0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb14685b0>,
  'sklearn_classifier': ExtraTreesClassifier(criterion='entropy', max_features=6, min_samples_leaf=16,
                       min_samples_split=20, n_estimators=512, n_jobs=1,
                       random_state=1, warm_start=True)},
 65: {'model_id': 65,
  'rank': 11,
  'cost': 0.1477272727272727,
  'ensemble_weight': 0.08,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb1887640>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb12bd250>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb12bd700>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=2, min_samples_split=7, n_estimators=512,
                       n_jobs=1, random_state=1, warm_start=True)},
 66: {'model_id': 66,
  'rank': 12,
  'cost': 0.2272727272727273,
  'ensemble_weight': 0.12,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb172c850>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb10b77f0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb10b7be0>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=3, min_samples_leaf=20, min_samples_split=17,
                       n_estimators=512, n_jobs=1, random_state=1,
                       warm_start=True)},
 74: {'model_id': 74,
  'rank': 13,
  'cost': 0.2272727272727273,
  'ensemble_weight': 0.04,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb13b9c70>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb0ed3160>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb0ed35b0>,
  'sklearn_classifier': LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr',
                             tol=0.011632803126809681)},
 75: {'model_id': 75,
  'rank': 14,
  'cost': 0.21590909090909094,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb129dd00>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb0c833d0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb0c83640>,
  'sklearn_classifier': PassiveAggressiveClassifier(C=0.01771591080165321, average=True, max_iter=16,
                              random_state=1, tol=9.18644810240989e-05,
                              warm_start=True)},
 79: {'model_id': 79,
  'rank': 15,
  'cost': 0.15909090909090906,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb0fee1c0>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb137a310>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb137a160>,
  'sklearn_classifier': ExtraTreesClassifier(bootstrap=True, criterion='entropy', max_features=2,
                       n_estimators=512, n_jobs=1, random_state=1,
                       warm_start=True)},
 89: {'model_id': 89,
  'rank': 16,
  'cost': 0.13636363636363635,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb0d9c2b0>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fff887ac670>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fff887ac8e0>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=2, min_samples_split=8, n_estimators=512,
                       n_jobs=1, random_state=1, warm_start=True)},
 92: {'model_id': 92,
  'rank': 17,
  'cost': 0.17045454545454541,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb0bd1760>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb08702b0>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb0870e20>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=2, min_samples_split=18, n_estimators=512,
                       n_jobs=1, random_state=1, warm_start=True)},
 93: {'model_id': 93,
  'rank': 18,
  'cost': 0.20454545454545459,
  'ensemble_weight': 0.02,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7fffb13c9940>,
  'balancing': Balancing(random_state=1, strategy='weighting'),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb0613640>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb0613a90>,
  'sklearn_classifier': LinearDiscriminantAnalysis(shrinkage=0.6627083162415924, solver='lsqr',
                             tol=0.06631009386339572)},
 94: {'model_id': 94,
  'rank': 19,
  'cost': 0.20454545454545459,
  'ensemble_weight': 0.06,
  'data_preprocessor': <autosklearn.pipeline.components.data_preprocessing.DataPreprocessorChoice at 0x7ffff05273d0>,
  'balancing': Balancing(random_state=1),
  'feature_preprocessor': <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice at 0x7fffb0488f70>,
  'classifier': <autosklearn.pipeline.components.classification.ClassifierChoice at 0x7fffb04ae430>,
  'sklearn_classifier': ExtraTreesClassifier(max_features=3, min_samples_leaf=2, min_samples_split=7,
                       n_estimators=512, n_jobs=1, random_state=1,
                       warm_start=True)}}
In [10]:
dump(autoclassifier, open('autoclassifier.pkl', 'wb'))
In [26]:
######################   PREDICTION USING AUTOCLASSIFIER  ###################### 
from sklearn.metrics import accuracy_score
scaler = load(open('scaler.pkl', 'rb'))
X_test_scaled = scaler.transform(X_test)

X_test_final = pd.DataFrame(X_test_scaled, columns=X_test.columns)

model = load(open('autoclassifier.pkl', 'rb'))
prediction = model.predict(X_test_final)
score = accuracy_score(y_test, prediction)
print(f"Accuracy : {score}")
Accuracy : 0.6153846153846154

3. Explainable AI (XAI) Implementation¶

3.1 SHAP Summary Plot Explanation for Liver Disease Prediction:¶

The SHAP (Shapley Additive Explanations) summary plot shows how the features in the liver disease dataset contribute to the model’s predictions for the entire test data. Each dot represents a SHAP value for a specific feature and patient, and the color of the dot (ranging from blue to red) represents the feature’s value (low to high). The SHAP values on the X-axis indicate whether a feature increases or decreases the likelihood of predicting liver disease.

3.1.1 Key Insights:¶

  • Albumin: High Albumin values (red) contribute negatively to the prediction of liver disease, meaning that higher levels of albumin reduce the likelihood of liver disease (negative SHAP values). Conversely, lower values (blue) increase the probability of liver disease.
  • Aspartate_Aminotransferase (AST): Higher values (red) increase the likelihood of liver disease, as indicated by the positive SHAP values. AST is an enzyme linked to liver function, and elevated levels suggest liver damage, increasing the likelihood of the disease.
  • Total_Bilirubin and Direct_Bilirubin: Both these bilirubin levels show that higher values (red) contribute positively to liver disease predictions, indicating that higher bilirubin levels strongly increase the probability of liver dysfunction.
  • Alkaline_Phosphatase: Higher values (red) of this enzyme also push the prediction towards liver disease, while lower values (blue) reduce the likelihood of the disease.
  • Total_Proteins: Lower levels (blue) of total proteins increase the likelihood of liver disease, while higher levels (red) reduce it, as higher protein levels generally indicate healthier liver function.
  • Age: Older age (red) contributes positively to predicting liver disease, while younger age (blue) lowers the probability. This aligns with the increased risk of liver disease as age progresses.
  • Albumin_and_Globulin_Ratio: A higher ratio (red) tends to decrease the likelihood of liver disease, while lower ratios (blue) contribute positively to predicting liver disease.

3.1.2 Overall Interpretation:¶

  • High values of certain features like Aspartate_Aminotransferase, Bilirubin, Alkaline_Phosphatase, and Age are strong indicators that push the model towards predicting liver disease.
  • Low values of features like Albumin, Total_Proteins, and Albumin_and_Globulin_Ratio push the model towards predicting liver disease as well.

This SHAP summary plot provides a comprehensive view of how each feature in the liver disease dataset impacts the model’s predictions for the test data. By analyzing the SHAP values and feature contributions, we can better understand the factors driving the model’s predictions.

In [17]:
# Initialize the SHAP explainer for the Auto-sklearn model
explainer = shap.Explainer(model.predict, X_test_final)

# Generate SHAP values for the test data
shap_values = explainer(X_test_final)

# Generate a SHAP summary plot
shap.summary_plot(shap_values, X_test_final, feature_names=X_test.columns)
ExactExplainer explainer: 118it [24:56, 12.79s/it]                                                                                                                                                                                              

3.2 LIME Explanation for Liver Disease Prediction:¶

The LIME (Local Interpretable Model-agnostic Explanations) output provides an explanation for an individual prediction made by the model regarding the presence or absence of liver disease. This explanation helps us understand which features contributed most to the model’s decision for this particular patient.

3.2.1 Key Components:¶

  1. Prediction Probabilities:
    • The model predicts a 56% probability for “No Liver Disease” and a 44% probability for “Liver Disease”. This indicates that the model slightly favors the prediction of “No Liver Disease” for this instance.
  2. Features Contributing to “No Liver Disease”:
    • Total_Proteins: With a value of 5.80, this feature strongly contributes to the prediction of “No Liver Disease” (1.92 impact).
    • Albumin: A value of 2.60 positively impacts the prediction of “No Liver Disease” (1.42 impact).
    • **Albumin_and_Globulin_Ratio: A small positive contribution (0.11 impact), suggesting this feature slightly pushes the prediction towards “No Liver Disease”.
    • Age: The age of the patient (32.00) also contributes positively to predicting “No Liver Disease”, but with a minimal impact (0.04).
  3. Features Contributing to “Liver Disease”:
    • Aspartate_Aminotransferase (AST): This feature, with a value of -0.20, slightly increases the likelihood of liver disease.
    • Alamine_Aminotransferase: A value of -0.18 also contributes towards predicting “Liver Disease”.
    • Direct_Bilirubin and Total_Bilirubin: These two bilirubin levels (-0.39 and -0.38 respectively) indicate an elevated likelihood of liver disease, suggesting some liver dysfunction.
    • Alkaline_Phosphotase: A value of -0.17 contributes to the prediction of liver disease.

3.2.2 Overall Interpretation:¶

  • The model’s final decision slightly favors “No Liver Disease” based on a combination of feature contributions. Features like Total_Proteins, Albumin, and Albumin_and_Globulin_Ratio strongly push the prediction towards “No Liver Disease”, while liver enzyme levels (such as AST and Bilirubin) provide opposing contributions, increasing the likelihood of “Liver Disease”. However, the positive impact of protein-related features outweighs the negative impact of the liver enzyme levels in this case.

This LIME explanation offers clear insights into why the model predicted “No Liver Disease” for this patient by highlighting the relative importance of specific features.

In [30]:
# Create a LIME explainer
# Use the training data to initialize LIME explainer to capture feature distributions
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train.values,  # Training data
    feature_names=X_train.columns,  # Feature names
    class_names=['No Liver Disease', 'Liver Disease'],  # Class labels
    mode='classification'  # We are in classification mode
)

# Define a prediction function for LIME that directly uses scaled data
def predict_fn(data):
    # Convert NumPy array back to DataFrame
    data_df = pd.DataFrame(data, columns=X_test_final.columns)  # Ensure correct column names
    return model.predict_proba(data_df)  # Predict probabilities

# Pick a specific instance from the scaled test data (e.g., first row)
sample_idx = 0  # Change this index to explain different samples

# Generate LIME explanation for this scaled instance
exp = explainer.explain_instance(X_test_final.iloc[sample_idx].values, predict_fn)

# Show the LIME explanation in the notebook
exp.show_in_notebook(show_all=False)

# Optionally, save the explanation as an HTML file
exp.save_to_file(f"lime_explanation_{sample_idx}.html")

3.3 Insights into the Model’s Decision-Making Process:¶

Through this exercise using SHAP and LIME, we were able to gain clear insights into the decision-making process of the liver disease prediction model. Both XAI (Explainable AI) tools provided a deep understanding of how specific features contributed to the model’s output for individual predictions as well as the overall feature importance.

3.3.1 Feature Importance and Model Interpretability:¶

  1. SHAP (Shapley Additive Explanations):
    • Global Insights: SHAP provided a holistic view of how features like Aspartate_Aminotransferase, Bilirubin, and Alkaline_Phosphotase contribute to the model’s predictions across the entire test dataset. The summary plot showed that high values of these features pushed the model toward predicting liver disease, while features like Albumin and Total_Proteins reduced the likelihood of the disease.
    • Local Insights: SHAP also provided explanations for individual patients, showing how each feature’s value either positively or negatively influenced the model’s prediction. This level of interpretability is important in understanding the role of different liver function indicators for each patient.
  2. LIME (Local Interpretable Model-Agnostic Explanations):
    • Instance-Level Interpretability: LIME helped explain why the model made specific predictions for individual patients. By breaking down each prediction and showing how the combination of features like Albumin, Total_Proteins, and Bilirubin influenced the model’s output, LIME made the decision-making process clear at a local (instance) level.
    • Clear Decision Boundaries: The LIME output also visually highlighted which features pushed the prediction towards “Liver Disease” and which ones leaned towards “No Liver Disease”, helping us understand the balance of contributing factors for each decision.

3.3.2 Importance of Interpretability:¶

Interpretability is crucial in medical contexts like liver disease prediction because it ensures that predictions made by the model can be trusted and validated. Stakeholders such as doctors and healthcare professionals need to understand why a model predicted a certain outcome before making any decisions based on it. By using XAI tools like SHAP and LIME, we provide this necessary transparency, offering:

  1. Trust and Reliability: Knowing which features the model considers important (e.g., liver enzyme levels or proteins) ensures that the model aligns with clinical expectations and that its predictions are reliable.
  2. Actionability: Interpretability allows healthcare professionals to make informed decisions based on the model’s predictions. If a feature such as Bilirubin strongly indicates liver disease, doctors can prioritize further tests based on this.
  3. Error Analysis: These tools also help identify potential biases or errors in the model’s decision-making process. If a particular feature is overly influencing predictions in an unexpected way, we can adjust the model accordingly.

3.3.3 How XAI Tools Helped Achieve Interpretability:¶

  1. SHAP provided a clear picture of how each feature affects the prediction globally and locally, offering a balanced perspective of both general trends and specific cases.
  2. LIME allowed us to drill down into individual predictions, making the model’s decision transparent for each test case. This is especially useful for understanding outliers or edge cases in the dataset.

Both tools complement each other in helping us understand and trust the model, ensuring that the predictions align with real-world medical knowledge, and providing actionable insights to healthcare professionals.